Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP error: invalid device function #6656

Open
wovynn opened this issue Jan 31, 2025 · 2 comments
Open

HIP error: invalid device function #6656

wovynn opened this issue Jan 31, 2025 · 2 comments
Labels
Potential Bug User is reporting a bug. This should be tested.

Comments

@wovynn
Copy link

wovynn commented Jan 31, 2025

Expected Behavior

Render something

Actual Behavior

Fails to queue anything including default workflow.

Steps to Reproduce

Step 1. Load default workflow
Step 2. Queue
Step 3. Failure

Debug Logs

7800 XT, tried using both stable and unstable versions of ROCm (6.2 & 6.3)

Very confusing because this error is related to unsupported GPU features, but my GPU is RDNA3 which as far as I know is fully supported by current ROCm.

Here is the output of rocminfo:

=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 7 7800X3D 8-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 7 7800X3D 8-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   5050                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    31948540(0x1e77efc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    31948540(0x1e77efc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    31948540(0x1e77efc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1101                            
  Uuid:                    GPU-aacb34feaea8812a               
  Marketing Name:          AMD Radeon RX 7800 XT              
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      4096(0x1000) KB                    
    L3:                      65536(0x10000) KB                  
  Chip ID:                 29822(0x747e)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          128(0x80)                          
  Max Clock Freq. (MHz):   2124                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            60                                 
  SIMDs per CU:            2                                  
  Shader Engines:          3                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 462                                
  SDMA engine uCode::      27                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1101         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx1036                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      256(0x100) KB                      
  Chip ID:                 5710(0x164e)                       
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          128(0x80)                          
  Max Clock Freq. (MHz):   2200                               
  BDFID:                   4608                               
  Internal Node ID:        2                                  
  Compute Unit:            2                                  
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       APU
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 21                                 
  SDMA engine uCode::      9                                  
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    15974268(0xf3bf7c) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    15974268(0xf3bf7c) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1036         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             
This error with all models:


{
  "prompt_id": "a902c515-6f4b-4f69-bd65-01cb4ec73f04",
  "node_id": "NegativeCLIP_Base",
  "node_type": "CLIPTextEncode",
  "executed": [],
  "exception_message": "HIP error: invalid device function\nHIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing AMD_SERIALIZE_KERNEL=3\nCompile with \u0060TORCH_USE_HIP_DSA\u0060 to enable device-side assertions.\n",
  "exception_type": "RuntimeError",
  "traceback": [
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/execution.py\u0022, line 327, in execute\n    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/execution.py\u0022, line 202, in get_output_data\n    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/execution.py\u0022, line 174, in _map_node_over_list\n    process_inputs(input_dict, i)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/execution.py\u0022, line 163, in process_inputs\n    results.append(getattr(obj, func)(**inputs))\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/nodes.py\u0022, line 69, in encode\n    return (clip.encode_from_tokens_scheduled(tokens), )\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/sd.py\u0022, line 148, in encode_from_tokens_scheduled\n    pooled_dict = self.encode_from_tokens(tokens, return_pooled=return_pooled, return_dict=True)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/sd.py\u0022, line 210, in encode_from_tokens\n    o = self.cond_stage_model.encode_token_weights(tokens)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/sdxl_clip.py\u0022, line 60, in encode_token_weights\n    g_out, g_pooled = self.clip_g.encode_token_weights(token_weight_pairs_g)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/sd1_clip.py\u0022, line 45, in encode_token_weights\n    o = self.encode(to_encode)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/sd1_clip.py\u0022, line 252, in encode\n    return self(tokens)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1736, in _wrapped_call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1747, in _call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/sd1_clip.py\u0022, line 224, in forward\n    outputs = self.transformer(tokens, attention_mask_model, intermediate_output=self.layer_idx, final_layer_norm_intermediate=self.layer_norm_hidden_state, dtype=torch.float32)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1736, in _wrapped_call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1747, in _call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/clip_model.py\u0022, line 137, in forward\n    x = self.text_model(*args, **kwargs)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1736, in _wrapped_call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1747, in _call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/clip_model.py\u0022, line 101, in forward\n    x = self.embeddings(input_tokens, dtype=dtype)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1736, in _wrapped_call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1747, in _call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/clip_model.py\u0022, line 82, in forward\n    return self.token_embedding(input_tokens, out_dtype=dtype) \u002B comfy.ops.cast_to(self.position_embedding.weight, dtype=dtype, device=input_tokens.device)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1736, in _wrapped_call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/modules/module.py\u0022, line 1747, in _call_impl\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/ops.py\u0022, line 203, in forward\n    return self.forward_comfy_cast_weights(*args, **kwargs)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/comfy/ops.py\u0022, line 199, in forward_comfy_cast_weights\n    return torch.nn.functional.embedding(input, weight, self.padding_idx, self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse).to(dtype=output_dtype)\n",
    "  File \u0022/opt/stabilitymatrix/Data/Packages/ComfyUI/venv/lib/python3.10/site-packages/torch/nn/functional.py\u0022, line 2551, in embedding\n"
  ],
  "current_inputs": {
    "clip": [
      "\u003Ccomfy.sd.CLIP object at 0x766f88883e50\u003E"
    ],
    "text": [
      ""
    ]
  },
  "current_outputs": [
    "SaveImage",
    "EmptyLatentImage",
    "NegativeCLIP_Base",
    "VAEDecode_1",
    "Sampler",
    "CheckpointLoader_Base",
    "PositiveCLIP_Base"
  ],
  "timestamp": 1738306666190
}

Other

No response

@wovynn wovynn added the Potential Bug User is reporting a bug. This should be tested. label Jan 31, 2025
@wovynn
Copy link
Author

wovynn commented Jan 31, 2025

The wheels in the git repo are busted.

Fixed by:

building python-torchvision-rocm and python-torchaudio-rocm from source via the AUR:
paru -S python-torchvision-rocm python-torchaudio-rocm

making a venv in ComfyUI:
cd ComfyUI && python -m venv venv && source ./venv/bin/activate

copying torchvision, torchaudio,torch, torchsde, and torchgen from/usr/lib/python3.13/site-packages to ComfyUI/venv/lib/python3.13/site-packages

Followed by pip install -r requirements.txt

@goch
Copy link

goch commented Feb 2, 2025

With an RX7600 I tried your fix Installed the following versions via.

yay -S python-torchvision-rocm python-torchvision-rocm python-torchvision-rocm

which build the package specifically for my gpu gfx1100:

Building PyTorch for GPU arch: gfx900;gfx906:xnack-;gfx908:xnack-;gfx90a:xnack+;gfx90a:xnack-;gfx940;gfx941;gfx942;gfx1010;gfx1012;gfx1030;gfx1100;gfx1101;gfx1102
HIP VERSION: 6.2.41134-0

which resulted in the following versions:

python-torchvision-rocm 0.20.1-1
python-torchvision-rocm 2.5.1-1
python-torchvision-rocm 0.2.6-1

Purged pip cache:
pip cache purge

Copied the site packages:

cd /usr/lib/python3.12/site-packages
cp -r torchvision torchaudio torch torchgen torchsde ~/comfyUI_ws/venv/lib/python3.12/site-packages/

Installed the requirements:

pip install -r requirements.txt

which still downloads torch:
Downloading torch-2.6.0-cp312-cp312-manylinux1_x86_64.whl (766.6 MB)

starting comfyUI
HSA_OVERRIDE_GFX_VERSION=11.0.0 python main.py

Checkpoint files will always be loaded safely.
Total VRAM 8176 MB, total RAM 32025 MB
pytorch version: 2.5.1
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7600 : native
Using sub quadratic optimization for attention, if you have memory or speed issues try using: --use-split-cross-attention
ComfyUI version: 0.3.13
[Prompt Server] web root: /home/user/nvme/comfyUI_ws/ComfyUI/web

Import times for custom nodes:
   0.0 seconds: /home/user/nvme/comfyUI_ws/ComfyUI/custom_nodes/websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type EPS
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float32
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load SDXLClipModel
loaded completely 6886.8 1560.802734375 True

Requested to load SDXL
loaded completely 5567.33173828125 4897.0483474731445 True
  0%|                                                                                                                                                                                         | 0/20 [00:00<?, ?it/s]
:0:rocdevice.cpp            :2984: 6985259037 us: [pid:58568 tid:0x789206fff6c0] Callback: Queue 0x789210600000 aborting with error : HSA_STATUS_ERROR_INVALID_ISA: The instruction set architecture is invalid. code: 0x100f

I also tried to create a venv with --system-site-packages, which resulted in pip not downloading torch but still same error.

any other ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Potential Bug User is reporting a bug. This should be tested.
Projects
None yet
Development

No branches or pull requests

2 participants