Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch-triton-rocm from pytorch-nightly install does not run any more #396

Closed
briansp2020 opened this issue Nov 12, 2023 · 22 comments
Closed
Assignees

Comments

@briansp2020
Copy link

I noticed that triton build that is installed along with pytorch nightly build was updated recently to 2.1.0+e8a35b3968 and no longer runs. It seems the way triton accesses the hardware has changed. Is there a guide that explains how to enable hip backend for triton?

$ python tmp/02-fused-softmax.py
Traceback (most recent call last):
File "/root/tmp/02-fused-softmax.py", line 143, in
y_triton = softmax(x)
File "/root/tmp/02-fused-softmax.py", line 121, in softmax
softmax_kernel[(n_rows,)](
File "", line 78, in softmax_kernel
RuntimeError: Triton Error [HIP]: Code: 719, Messsage: unspecified launch failure

@zhanglx13
Copy link

@jataylo Do you have any comments on this.
We usually install triton from source and not use the pre-installed triton from pytorch.

@jataylo
Copy link

jataylo commented Nov 15, 2023

@briansp2020 Thanks for reporting this I will attempt to replicate with the latest nightly.

In the meantime I recommend to install from source from the https://github.com/ROCmSoftwarePlatform/triton/tree/pytorch_nightly_11-03-2023 branch. This is the same commit used in PyTorch but this may exclusively be a wheel issue. Or you could try build with our latest triton from the triton-mlir branch but you may see incompatibilities with pytorch.

@jataylo jataylo self-assigned this Nov 15, 2023
@jataylo
Copy link

jataylo commented Nov 15, 2023

Hey @briansp2020 I am unable to replicate this failure by following the nightly install instructions here:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7

And the softmax test is also passing for me on our latest nightly image

rocm/pytorch-nightly:latest

Could you help me get the details of your environment setup e.g. which hardware, docker/bare metal environment?

@briansp2020
Copy link
Author

briansp2020 commented Nov 15, 2023

Just installed again today and I still see the problem.

pip list
Package Version


certifi 2022.12.7
charset-normalizer 2.1.1
filelock 3.9.0
fsspec 2023.4.0
idna 3.4
Jinja2 3.1.2
MarkupSafe 2.1.3
mpmath 1.2.1
networkx 3.0rc1
numpy 1.24.1
Pillow 9.3.0
pip 22.0.2
pytorch-triton-rocm 2.1.0+e8a35b3968
requests 2.28.1
setuptools 59.6.0
sympy 1.11.1
torch 2.2.0.dev20231115+rocm5.7
torchaudio 2.2.0.dev20231115+rocm5.7
torchvision 0.17.0.dev20231115+rocm5.7
typing_extensions 4.8.0
urllib3 1.26.13

I just created a new venv and ran the pip command. It's a docker I built using https://gist.github.com/briansp2020/fd1579b3d7fe4643409593e229fbd26f

The hardware is Ryzen 9 7900X, Radeon 7900XTX, ASUS Strix B650E-F, 64GB RAM. Base OS Ubuntu 22.04 server. I just pulled rocm/pytorch-nightly:latest and it is showing the same error. I'll try installing it from the https://github.com/ROCmSoftwarePlatform/triton/tree/pytorch_nightly_11-03-2023

In the past, before pytorch-triton-rocm 2.1.0+e8a35b3968, when I installed from source after installing pytorch-nightly, pip list would show both pytorch-triton-rocm and triron. Is that expected behavior?

@jataylo
Copy link

jataylo commented Nov 15, 2023

@briansp2020 Thanks for the detailed information.

I have reproduced your issue on Navi31 at Pytorch's triton commit. I have also confirmed that this workload passes with the latest commit of triton-mlir branch.

If you want to get around this for now I recommend building the triton-mlir branch from source as it will take us a bit of time to get the pytorch commit back in sync.

In the past, before pytorch-triton-rocm 2.1.0+e8a35b3968, when I installed from source after installing pytorch-nightly, pip list would show both pytorch-triton-rocm and triron. Is that expected behavior?

This is expected, pytorch-triton-rocm is the name of the triton wheel that is installed as a dependency on pytorch, but if triton is built from source it is simply named triton. I recommend to only have one of these installed at a time.

@briansp2020
Copy link
Author

@jataylo
Thanks for the quick response. I'm trying to install from the source and it does not seem to work and I'm not sure what I'm doing wrong.
If I install on top of pytorch-nightly install, it does not seem to do anything (i.e. I get the same error). If I uninstall pytorch-triton-rocm 2.1.0 before installing from the source, the installation seems broken and I get the following error.

(pt) root@rocm:/triton# pip install -e python
Obtaining file:///root/triton/python
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Preparing editable metadata (pyproject.toml) ... done
Requirement already satisfied: filelock in /root/pt/lib/python3.10/site-packages (from triton==2.1.0) (3.9.0)
Building wheels for collected packages: triton
Building editable for triton (pyproject.toml) ... done
Created wheel for triton: filename=triton-2.1.0-0.editable-cp310-cp310-linux_x86_64.whl size=3117 sha256=bd09a4849375d33546dd7f6bc1ad9c24a139eaf93123f51d2cbfec5a0819be2d
Stored in directory: /tmp/pip-ephem-wheel-cache-9x209ooe/wheels/35/45/3e/fc4d040c898a9a79bc55075a9dfa7801b83564de73a8db92ed
Successfully built triton
Installing collected packages: triton
Successfully installed triton-2.1.0
(pt) root@rocm:/triton# cd
(pt) root@rocm:~# python tmp/02-fused-softmax.py
Traceback (most recent call last):
File "/root/tmp/02-fused-softmax.py", line 73, in
@triton.jit
AttributeError: module 'triton' has no attribute 'jit'

@jataylo
Copy link

jataylo commented Nov 16, 2023

@briansp2020 I believe you are trying to build from the wrong directory, can you try this out?

cd triton/python && python setup.py develop

@briansp2020
Copy link
Author

Ok. That seemed to have installed it. Triton still does not seem to work on 7900XTX though as I got a kernel page fault message. :(

The problem was that I was running "pip3 install -e ." as instructed on the main page instead of "python setup.py develop"
The main page should be updated with proper installation instructions.

@jataylo
Copy link

jataylo commented Nov 16, 2023

Ok. That seemed to have installed it. Triton still does not seem to work on 7900XTX though as I got a kernel page fault message. :(

@briansp2020 Was this using the softmax example again? I did run this yesterday on 7900XTX (hit a tolerance error but it did execute at least). Could you try to clear the triton cache if you have reinstalled in a pre-existing environment with

rm -rf ~/.triton

Please let me know if this still fails and any additional repro information

@briansp2020
Copy link
Author

Kernel panic with 7900XTX was a known issue. (ROCm/pytorch#1284, ROCm/AMDMIGraphX#2174 (comment), #223 (comment), ...)

I rebooted the system and started a fresh docker and still got the page fault. See this dmesg shows

[ 2749.139652] [drm] PCIE GART of 512M enabled (table at 0x00000085FEB00000).
[ 2749.139670] [drm] PSP is resuming...
[ 2749.199638] [drm] reserve 0x1300000 from 0x85fc000000 for PSP TMR
[ 2749.340972] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 2749.340973] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 2749.340974] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[ 2749.340976] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000003d, smu fw if version = 0x0000003f, smu fw program = 0, smu fw version = 0x004e7300 (78.115.0)
[ 2749.340978] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[ 2749.491877] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[ 2749.493930] [drm] DMUB hardware initialized: version=0x07002100
[ 2749.501321] [drm] kiq ring mec 3 pipe 1 q 0
[ 2749.506521] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 2749.506699] amdgpu 0000:03:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 2749.506929] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 2749.506930] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 2749.506931] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 2749.506931] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 2749.506932] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 2749.506932] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 2749.506933] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 2749.506933] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 2749.506934] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 2749.506934] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 2749.506934] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 2749.506935] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 2749.506935] amdgpu 0000:03:00.0: amdgpu: ring vcn_unified_1 uses VM inv eng 1 on hub 8
[ 2749.506935] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 4 on hub 8
[ 2749.506936] amdgpu 0000:03:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 14 on hub 0
[ 2749.509825] amdgpu 0000:03:00.0: [drm] Cannot find any crtc or sizes
[ 2749.510938] [drm] ring gfx_32770.1.1 was added
[ 2749.511224] [drm] ring compute_32770.2.2 was added
[ 2749.511432] [drm] ring sdma_32770.3.3 was added
[ 2749.511454] [drm] ring gfx_32770.1.1 ib test pass
[ 2749.511470] [drm] ring compute_32770.2.2 ib test pass
[ 2749.511543] [drm] ring sdma_32770.3.3 ib test pass
[ 2750.051882] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dcd000 flags=0x0020]
[ 2750.152468] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 2750.152476] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dcd000 flags=0x0000]
[ 2750.152544] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 2750.152551] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dcd000 flags=0x0020]
[ 2750.152622] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[ 2750.152685] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 2750.152743] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
[ 2750.152789] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1
[ 2750.152836] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 2750.152886] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 2750.152934] amdgpu 0000:03:00.0: amdgpu: RW: 0x0
[ 2750.183365] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dcd000 flags=0x0020]
[ 2750.284453] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 2750.284456] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dcd000 flags=0x0000]
[ 2750.284523] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 2750.284530] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dcd000 flags=0x0020]
[ 2750.284602] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[ 2750.284665] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 2750.284723] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
[ 2750.284769] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x1
[ 2750.284816] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 2750.284866] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 2750.284914] amdgpu 0000:03:00.0: amdgpu: RW: 0x0

Currently, I have installed .6.0 dkms driver on the host and am running 5.7.1 rocm runtime environment in a docker container. So, it could be caused by a mismatch between user-land code and kernel module. At the moment, I'm a bit reluctant to go back to 5.7.1 kernel module since my attempt for downgrading kernel module has failed a few times in the past. Hope I provided enough information so that you can track it down.

@jataylo
Copy link

jataylo commented Nov 21, 2023

@briansp2020

Currently, I have installed .6.0 dkms driver on the host and am running 5.7.1 rocm runtime environment in a docker container. So, it could be caused by a mismatch between user-land code and kernel module.

This seems likely, I was able to get this working with a 5.7.1 host if you can give that a try. In the meantime I'm working on getting pytorch triton back in sync with triton-mlir to resolve this issue with the pytorch triton wheel. Will keep you posted.

@briansp2020
Copy link
Author

I reinstalled Ubuntu Server 22.04 and then, installed and built a new docker with ROCm5.7.1. The kernel error message still happens. The problem is very consistent and reproducible for me. If you don't mind, could you tell me what hardware you are using for 7900XTX testing?

[ 449.782858] gmc_v11_0_process_interrupt: 22 callbacks suppressed
[ 449.782862] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 449.782890] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 449.782905] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[ 449.782917] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 449.782929] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
[ 449.782938] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x5
[ 449.782947] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 449.782956] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 449.782965] amdgpu 0000:03:00.0: amdgpu: RW: 0x0
[ 449.782987] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77ded000 flags=0x0000]
[ 449.783005] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77ded000 flags=0x0020]
[ 449.816134] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77ded000 flags=0x0000]
[ 449.831207] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 449.831231] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 449.831241] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[ 449.831250] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 449.831258] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
[ 449.831264] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x5
[ 449.831270] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 449.831276] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 449.831283] amdgpu 0000:03:00.0: amdgpu: RW: 0x0
[ 449.831298] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dd4000 flags=0x0000]
[ 449.867498] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dd4000 flags=0x0000]
[ 449.867530] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dd4000 flags=0x0020]
[ 449.934793] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dbd000 flags=0x0000]
[ 449.934821] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dbd000 flags=0x0020]
[ 450.033793] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dbd000 flags=0x0020]
[ 450.134813] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 450.134846] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 450.134866] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[ 450.134882] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 450.134896] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
[ 450.134908] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x5
[ 450.134919] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 450.134932] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 450.134943] amdgpu 0000:03:00.0: amdgpu: RW: 0x0
[ 450.134964] amdgpu 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x77dbd000 flags=0x0000]
[ 450.282997] amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 450.283026] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 450.283044] amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B3A
[ 450.283058] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 450.283071] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
[ 450.283081] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x5
[ 450.283091] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 450.283102] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 450.283112] amdgpu 0000:03:00.0: amdgpu: RW: 0x0

@briansp2020
Copy link
Author

@jataylo
Also, what kernel version are you using? I'm running Linux 5.15.0-89-generic which seems to be the default on Ubuntu Server. I had issues with 6.2 which is the default on Ubuntu 22.04 desktop.

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this issue Nov 27, 2023
Small bump in rocm triton commit pin to resolve reported issue on 7900XTX
> RuntimeError: Triton Error [HIP]: Code: 719, Messsage: unspecified launch failure
ROCm/triton#396

Pull Request resolved: #114348
Approved by: https://github.com/jeffdaily
@jataylo
Copy link

jataylo commented Nov 28, 2023

Hi @briansp2020 the kernel error was still present with the triton pinned in torch - this should be resolved when the next nightlies are updated later today. For reference we are pinning to this branch (https://github.com/ROCmSoftwarePlatform/triton/tree/pytorch_nightly/23_11_2023) in pytorch nightly now.

I was just checking over your log from earlier
https://gist.github.com/briansp2020/b3c10b59c42d8cd5b612a9624d3fed8e

AssertionError: (tensor([[3.1729e-04, 5.2277e-04, 5.6863e-05,  ..., 5.3765e-04, 1.9698e-04,
         9.1166e-04],
        [8.6817e-04, 1.1062e-03, 1.0191e-04,  ..., 1.1457e-04, 5.5374e-04,
         8.5950e-04],
        [1.4239e-03, 1.0614e-03, 1.7198e-03,  ..., 1.2392e-03, 7.9831e-04,
         3.2696e-04],
        ...,
        [8.8930e-04, 1.0401e-03, 2.1141e-04,  ..., 1.3419e-03, 2.8997e-04,
         4.1784e-04],
        [1.0344e-03, 4.7134e-04, 1.8638e-03,  ..., 2.9375e-03, 3.1842e-04,
         9.1934e-04],
        [3.5456e-04, 7.5649e-04, 6.9392e-04,  ..., 1.0075e-04, 6.3662e-04,
         1.0441e-03]], device='cuda:0'), tensor([[3.2747e-04, 5.3954e-04, 5.8687e-05,  ..., 5.5490e-04, 2.0330e-04,
         9.4092e-04],
        [8.6914e-04, 1.1075e-03, 1.0202e-04,  ..., 1.1470e-04, 5.5436e-04,
         8.6046e-04],
        [1.4101e-03, 1.0511e-03, 1.7031e-03,  ..., 1.2272e-03, 7.9059e-04,
         3.2380e-04],
        ...,
        [8.8591e-04, 1.0361e-03, 2.1061e-04,  ..., 1.3368e-03, 2.8886e-04,
         4.1625e-04],
        [1.0660e-03, 4.8571e-04, 1.9206e-03,  ..., 3.0271e-03, 3.2813e-04,
         9.4738e-04],
        [3.5291e-04, 7.5298e-04, 6.9070e-04,  ..., 1.0028e-04, 6.3366e-04,
         1.0393e-03]], device='cuda:0'))

I also see this AssertionError on N31 on my side so seems we are at the same point, likely seems to be a tolerance issue. cc: @zhanglx13

P.S I'm also using Linux 5.15.0-89-generic on Ubuntu server environment

@briansp2020
Copy link
Author

@jataylo
Thank you for the fix. I just installed ROCm 6 beta kernel driver (it seems to have been updated yesterday.) and installed pytorch nightly (pytorch-triton-rocm 2.1.0+dafe145982) and verified that the page fault no longer happens.

Since you can reproduce the assertion error, I'll wait till it gets resolved.

I also want to bring to your attention that 03-matrix-multiplication is showing very poor performance. I saw the same issue with MI100 as well. So, I don't know whether it's NAVI31 specific issue or triton ROCm support in general.

python 03-matrix-multiplication.py
matmul-performance:
M N K cuBLAS Triton
0 256.0 256.0 256.0 3.524625 1.302579
1 384.0 384.0 384.0 10.485760 3.181073
2 512.0 512.0 512.0 19.228901 4.279902
3 640.0 640.0 640.0 30.768076 4.754110
4 768.0 768.0 768.0 38.519118 6.748830
5 896.0 896.0 896.0 47.954877 5.833903
6 1024.0 1024.0 1024.0 42.692663 6.379890
7 1152.0 1152.0 1152.0 60.956671 6.933394
8 1280.0 1280.0 1280.0 59.341324 6.687304
9 1408.0 1408.0 1408.0 64.793626 6.670066
10 1536.0 1536.0 1536.0 73.655320 7.195896
11 1664.0 1664.0 1664.0 75.309013 5.936619
12 1792.0 1792.0 1792.0 73.870966 6.225164
13 1920.0 1920.0 1920.0 83.741676 6.486762
14 2048.0 2048.0 2048.0 82.658311 6.546061
15 2176.0 2176.0 2176.0 80.803289 6.544062
16 2304.0 2304.0 2304.0 91.779222 6.860620
17 2432.0 2432.0 2432.0 91.022283 6.726674
18 2560.0 2560.0 2560.0 89.069244 6.840864
19 2688.0 2688.0 2688.0 89.024428 6.912130
20 2816.0 2816.0 2816.0 89.521712 6.948294
21 2944.0 2944.0 2944.0 89.191222 6.959370
22 3072.0 3072.0 3072.0 88.131293 7.147805
23 3200.0 3200.0 3200.0 87.778561 6.971809
24 3328.0 3328.0 3328.0 87.866782 7.024228
25 3456.0 3456.0 3456.0 88.164825 7.049201
26 3584.0 3584.0 3584.0 88.903976 7.144980
27 3712.0 3712.0 3712.0 76.510066 7.068753
28 3840.0 3840.0 3840.0 80.079431 7.128153
29 3968.0 3968.0 3968.0 78.901346 7.175697
30 4096.0 4096.0 4096.0 82.087560 7.204888

I bought MI100 thinking that it would be in a better shape but triton support does not seem much better on MI100. I wish I had access to MI200 series so I can test the software. I hope MI200 & MI300 are better supported.

@scxiao
Copy link

scxiao commented Dec 1, 2023

Hi @briansp2020, MI100 has much better performance than what you got above, see what we get internally:

python 03-matrix-multiplication.py

matmul-performance:
M N K rocBLAS Triton
0 1024.0 1024.0 1024.0 53.050487 37.386553
1 2048.0 2048.0 2048.0 108.022314 59.224588
2 4096.0 4096.0 4096.0 124.239614 84.426004
3 8192.0 8192.0 8192.0 124.371145 86.070328
4 9728.0 8192.0 65536.0 71.238086 75.257901

@briansp2020
Copy link
Author

@scxiao
Sorry for the confusion. The number I posted is 7900XTX. Below is what I'm getting for MI100. It seems close enough to the number you posted. I think I was expecting Triton matmult to be more similar to BLAS. Not 70% of BLAS. If what you posted is the performance you expect, I don't have any problem with MI100.

python 03-matrix-multiplication.py
matmul-performance:
M N K cuBLAS Triton
0 256.0 256.0 256.0 2.056031 1.906502
1 384.0 384.0 384.0 5.801548 5.055634
2 512.0 512.0 512.0 12.157403 10.046237
3 640.0 640.0 640.0 20.608805 16.890722
4 768.0 768.0 768.0 30.118672 22.292560
5 896.0 896.0 896.0 43.648247 31.884890
6 1024.0 1024.0 1024.0 53.473198 38.679459
7 1152.0 1152.0 1152.0 67.289781 48.380503
8 1280.0 1280.0 1280.0 68.804201 54.956813
9 1408.0 1408.0 1408.0 81.712803 57.767164
10 1536.0 1536.0 1536.0 93.015369 64.619806
11 1664.0 1664.0 1664.0 87.263002 73.837229
12 1792.0 1792.0 1792.0 101.170622 72.294237
13 1920.0 1920.0 1920.0 112.705227 80.067060
14 2048.0 2048.0 2048.0 107.266912 57.373328
15 2176.0 2176.0 2176.0 65.476238 71.970824
16 2304.0 2304.0 2304.0 119.532744 77.467764
17 2432.0 2432.0 2432.0 72.414243 75.532365
18 2560.0 2560.0 2560.0 79.694257 81.033888
19 2688.0 2688.0 2688.0 87.233957 78.834893
20 2816.0 2816.0 2816.0 106.946715 77.386032
21 2944.0 2944.0 2944.0 112.267017 78.889708
22 3072.0 3072.0 3072.0 118.350294 76.842315
23 3200.0 3200.0 3200.0 94.031354 78.746607
24 3328.0 3328.0 3328.0 102.821901 80.704994
25 3456.0 3456.0 3456.0 107.361364 64.096722
26 3584.0 3584.0 3584.0 113.614851 67.581790
27 3712.0 3712.0 3712.0 117.203254 68.189406
28 3840.0 3840.0 3840.0 125.394563 71.642235
29 3968.0 3968.0 3968.0 81.273201 68.408729
30 4096.0 4096.0 4096.0 104.704381 61.269205

@scxiao
Copy link

scxiao commented Dec 1, 2023

Thanks for the reply. We are aware of the performance gap and agree it can be better (compared to both rocblas as you mentioned here and Triton MI200 numbers we have). We have an internal issue to track it and are working on it. Will let you know if there is any update.

@briansp2020
Copy link
Author

@scxiao
Thank you. If you don't mind, could you post numbers for MI200. I would love to test it to find out MI200 performance but I don't have access to it.

Thank you!

@scxiao
Copy link

scxiao commented Dec 1, 2023

Here is the numbers on 1 GCD on MI200,

    M       N        K     rocBLAS      Triton

0 1024.0 1024.0 1024.0 60.732002 49.710268
1 2048.0 2048.0 2048.0 107.267251 100.632439
2 4096.0 4096.0 4096.0 131.204889 123.525760
3 8192.0 8192.0 8192.0 124.852141 121.615166
4 9728.0 8192.0 65536.0 111.380459 110.469548

@briansp2020
Copy link
Author

briansp2020 commented Dec 16, 2023

I just tried ROCm6.0 release with nightly build of pytorch & locally built triton and I still see page fault error message when running 02-fused-softmax.py

[ 1324.874369] amdgpu 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xb7ded000 flags=0x0000]
[ 1324.874462] amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 1324.874505] amdgpu 0000:09:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 1324.874532] amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[ 1324.874552] amdgpu 0000:09:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 1324.874572] amdgpu 0000:09:00.0: amdgpu: MORE_FAULTS: 0x0
[ 1324.874588] amdgpu 0000:09:00.0: amdgpu: WALKER_ERROR: 0x1
[ 1324.874604] amdgpu 0000:09:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 1324.874621] amdgpu 0000:09:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 1324.874637] amdgpu 0000:09:00.0: amdgpu: RW: 0x0
[ 1324.975576] amdgpu 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xfefd9000 flags=0x0000]
[ 1324.975579] amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 1324.975611] amdgpu 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xfefd9000 flags=0x0020]
[ 1324.975638] amdgpu 0000:09:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 1324.975687] amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[ 1324.975706] amdgpu 0000:09:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 1324.975723] amdgpu 0000:09:00.0: amdgpu: MORE_FAULTS: 0x0
[ 1324.975737] amdgpu 0000:09:00.0: amdgpu: WALKER_ERROR: 0x1
[ 1324.975751] amdgpu 0000:09:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 1324.975765] amdgpu 0000:09:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 1324.975779] amdgpu 0000:09:00.0: amdgpu: RW: 0x0
[ 1325.244953] amdgpu 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xfefd9000 flags=0x0020]
[ 1325.347609] amdgpu 0000:09:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:153 vmid:0 pasid:0, for process pid 0 thread pid 0)
[ 1325.347636] amdgpu 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xfefd9000 flags=0x0000]
[ 1325.347648] amdgpu 0000:09:00.0: amdgpu: in page starting at address 0x0000000000000000 from client 10
[ 1325.347681] amdgpu 0000:09:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0013 address=0xfefd9000 flags=0x0020]
[ 1325.347700] amdgpu 0000:09:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000B32
[ 1325.347742] amdgpu 0000:09:00.0: amdgpu: Faulty UTCL2 client ID: CPC (0x5)
[ 1325.347759] amdgpu 0000:09:00.0: amdgpu: MORE_FAULTS: 0x0
[ 1325.347773] amdgpu 0000:09:00.0: amdgpu: WALKER_ERROR: 0x1
[ 1325.347786] amdgpu 0000:09:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[ 1325.347801] amdgpu 0000:09:00.0: amdgpu: MAPPING_ERROR: 0x1
[ 1325.347815] amdgpu 0000:09:00.0: amdgpu: RW: 0x0

@briansp2020
Copy link
Author

briansp2020 commented May 3, 2024

I just tried the latest pytorch nightly build and ROCm 6.1 and it seems to work much better. I'll close this now.

(pt) root@rocm:~# python tmp/02-fused-softmax.py
softmax-performance:
          N      Triton  Torch (native)  Torch (jit)
0     256.0  399.457529      362.194621   102.248970
1     384.0  432.090651      352.255309   116.593734
2     512.0  434.642885      290.862078   116.507630
3     640.0  419.766225      305.346741   237.446586
4     768.0  418.036955      297.043523   225.942583
..      ...         ...             ...          ...
93  12160.0  635.493504      636.264744   257.832348
94  12288.0  646.015177      652.125886   261.388774
95  12416.0  644.146583      648.851096   260.235986
96  12544.0  644.299467      648.405574   261.121662
97  12672.0  645.369805      650.567489   260.931751

[98 rows x 4 columns]
(pt) root@rocm:~# python tmp/03-matrix-multiplication.py
matmul-performance:
         M       N       K     cuBLAS     Triton
0    256.0   256.0   256.0   3.480750   3.423922
1    384.0   384.0   384.0  10.294174   9.695737
2    512.0   512.0   512.0  18.436501  17.613875
3    640.0   640.0   640.0  29.925115  27.886176
4    768.0   768.0   768.0  38.194337  32.035703
5    896.0   896.0   896.0  47.323890  41.245591
6   1024.0  1024.0  1024.0  42.106696  43.365111
7   1152.0  1152.0  1152.0  59.718513  54.097551
8   1280.0  1280.0  1280.0  60.262123  57.803831
9   1408.0  1408.0  1408.0  68.046689  53.534379
10  1536.0  1536.0  1536.0  74.981194  58.962245
11  1664.0  1664.0  1664.0  74.674320  60.783406
12  1792.0  1792.0  1792.0  73.756877  63.250405
13  1920.0  1920.0  1920.0  82.723297  66.433155
14  2048.0  2048.0  2048.0  82.151983  66.146890
15  2176.0  2176.0  2176.0  80.018544  65.350404
16  2304.0  2304.0  2304.0  90.334664  67.608744
17  2432.0  2432.0  2432.0  90.284727  66.055139
18  2560.0  2560.0  2560.0  88.476923  68.243480
19  2688.0  2688.0  2688.0  88.231240  67.137226
20  2816.0  2816.0  2816.0  88.971880  66.520229
21  2944.0  2944.0  2944.0  88.320087  66.250415
22  3072.0  3072.0  3072.0  86.751499  68.053102
23  3200.0  3200.0  3200.0  86.549304  66.175656
24  3328.0  3328.0  3328.0  87.335369  56.081508
25  3456.0  3456.0  3456.0  88.094691  59.392300
26  3584.0  3584.0  3584.0  88.906944  60.402034
27  3712.0  3712.0  3712.0  77.511448  60.383596
28  3840.0  3840.0  3840.0  79.673419  62.019872
29  3968.0  3968.0  3968.0  78.949798  60.995023
30  4096.0  4096.0  4096.0  82.309423  64.113827

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants