-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When deploying ComfyUI on a fresh Windows installation using Miniconda, I encountered the "1Torch was not compiled with flash attention" warning during the initial inference. #3265
Comments
After reinstalling and installing 50 plugins, I noticed that it takes a long time to load the model when using the ComfyUI_VLM_nodes node. Then I saw this warning, so I reinstalled comfyui without installing any nodes and found that the warning still exists. To resolve this warning, I even reinstalled the entire Windows system, but the warning persists. |
Try installing the flash-attention-2.3.6 py311 ada / sm_89 (not xformers) wheel from my link on the discussions page you posted on (yeah it's from last December, doesn't seem to matter). xformers builds flash-attn-2 in-tree instead of as a dependency and tosses it in an anonymously named .pyd file that nothing else can use. Torch (seems) to look for one in the base install via lazy module loading; the error is just misleading. I think I'm still using 2.3.6 because I built a newer version like 2.4.2 around february and tried installing it and something started complaining but xformers is on v2.5.6 now so a newer build might work unless torch requires a specific version. How old is your comfy version exactly? Unless you either pass --disable-xformers or --use-pytorch-cross-attention (which do the same thing more or less) to comfy with xformers 0.0.25 installed that line of code that errored calling torch.nn..... should never execute, it's a version specific thing for old xformers otherwise. The tricksy part is, unless you specifically run one of those two options, you'll still get that version line for xformers and messed up settings. Ooops... errr It looks like there's a bug in the logic in model_management.py in comfy that's causing pytorch flash attention to be selected on nvidia
and this:
Except this happens kinda later and all the logic earlier in the file to turn off xformers if pytorch was enabled in the args (and vice versa) gets tossed and you get a mix of the two being enabled. I think this has been going on with my system for a while now so I'm patching that out to check. As far as flash attention, the reason it isn't included, btw, is that the author of it can't figure out how to write a makefile that doesn't require CPU cores * 4GB peak ram for compilation without changing environment variables. This has almost nothing to do with Windows but it means unbreaking their setup.py every time it changes because the "fix" of building it on 1 CPU core is more broken than the original situation and IIRC it always forces builds of sm_80/compute_80 and sm_90/compute_90 whether you want them or not. Most people don't want them because they don't have $40k Hoppers laying around, and ada is covered poorly in that combo. I think they've fixed it now in both torch and xformers but I've dumped other things in torch/lib that weren't default on Windows to make functionality work, like 2:4 sparsity via CUBlasLT, and somebody made a Triton v3.0.0 wheel so that can be installed to get all available functionality enabled in xformers aside from inter-GPU communications which torch fixed in the next version. Torch shows that Inductor is available with Triton installed. It's an annoying system since nobody seems to want to maintain what amounts to a makefile in most cases for some of these projects, but you can either build or find everything laying around |
Yeah that was forcing it to use pytorch attention which was using the installed flash-attention-2 on my system via Torch's lazy-as-hell loading mechanisms where flash isn't even checked for when it's enabled, only when it's called and doesn't exist (so I didn't notice it). 2048x2048 straight image refinement is back to being slightly faster than using hypertiling to do it again with SDXL, and I gained .75it/s on 1024x1024 (for SDXL). Hypertiling is still faster when generating oddball sizes, but that's a deficiency of having a bunch of preset-sized kernels in flash-attn and is easy to work around, and it's accelerating the hypertiling batches as well (which are more well behaved if they're power-of-two sizes and the image isn't) so it's hard to get a metric. Edit: Presumably something somewhere else was bypassing xformers in the logic. Torch can use flash attention but doesn't have as advanced of kernel selection logic AFAICT which is the only reason it's any slower when they use the same code. Calling something like basic memory efficient in xformers is more like a factory call that picks a function from a larger set. I think torch was working towards doing something like that. The logic to use xformers for VAE is different so it gets used there no matter what. I don't have time to go over all the code right now but something is funky about the selection somewhere. Not to mention that custom nodes tend to just ignore model management entirely and call the torch version or their own version of quad no matter what and don't get any benefit, but that's a fix that needs to be made on their end. |
After giving up on solving a problem I couldn't resolve, I installed xformers. Prior to this, I only used the command |
Try downloading and installing this : |
Okay, I'll try. Thank you. |
But I'm still filled with doubt as to why I'm getting this warning with a fresh install of comfyui |
(D:\AI\ComfyUI\venv-comfyui) D:\AI\ComfyUI> pip uninstall xformers (D:\AI\ComfyUI\venv-comfyui) D:\AI\ComfyUI> pip install C:\Users\liao1\Downloads\flash_attn-2.3.6-cp311-cp311-win_amd64.whl (D:\AI\ComfyUI\venv-comfyui) D:\AI\ComfyUI>D:\AI\ComfyUI>call conda activate D:\AI\ComfyUI\venv-comfyui Prestartup times for custom nodes: Total VRAM 8188 MB, total RAM 65268 MB Cannot import D:\AI\ComfyUI\comfy_extras\nodes_canny.py module for custom nodes: DLL load failed while importing flash_attn_2_cuda: 找不到指定的程序。 Cannot import D:\AI\ComfyUI\comfy_extras\nodes_morphology.py module for custom nodes: DLL load failed while importing flash_attn_2_cuda: 找不到指定的程序。 Loading: ComfyUI-Manager (V2.17)ComfyUI Revision: 2128 [258dbc0] | Released on '2024-04-14'Import times for custom nodes: WARNING: some comfy_extras/ nodes did not import correctly. This may be because they are missing some dependencies. IMPORT FAILED: nodes_canny.py This issue might be caused by new missing dependencies added the last time you updated ComfyUI. Starting server To see the GUI go to: http://127.0.0.1:8188 |
To see the GUI go to: http://127.0.0.1:8188 |
There's a bigger cluster in that flash_attn is in the same repo as flash_attn-2 but not built by default and a different codebase
Oh I would be too, I'm just shooting in the dark here (and hopefully things can be narrowed down). I suspect kornia needs a specific flash-attention build (they call it but they don't have it in their requirements anywhere)... the reason that suddenly showed up is that you uninstalled xformers which is what it uses preferentially. I might not have been clear but you want both installed, flash attention is for things that require it as a fallback. Most of those have the decency to build the version they need or at least put it in their requirements files. :P I'd just reinstall xformers from pip install xformers --extra-index-url https://download.pytorch.org/whl/cu121 to make sure you get the build with flash attention cuda kernels again, and remove flash-attention or not depending on whether it's actively breaking or fixing anything. |
Versions of Comfy that were bundling / depending on Pytorch 2.1.2 did not have this issue, the problem is solely with how Pytorch versions above that are compiled on Windows. You can get the faster Pytorch attention going again on Windows / Nvidia by uninstalling Xformers (which is not even currently a direct dependency of Comfy at all) if it's present, and rolling back Torch to "2.1.2+cu121". |
Warning: 1Torch was not compiled with flash attention. First of all, let me tell you a good news. Failure usually does not affect the program running, but it is slower. This warning is caused by the fact that after torch=2.2 update, flash attention V2 needs to be started as the optimal mechanism, but it is not successfully started. In this blog https://pytorch.org/blog/pytorch2-2/, it is written that pytorch 2.2 has major updates scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions. Usually, the order of function calls is FlashAttention > Memory-Efficient Attention(xformers) > PyTorch C++ implementation(math) (I don't understand why it is designed this way, and the meaning is completely unclear from the warning. I hope the official next version will improve it) But the pits I want to solve are the following places:
|
thanks for your great job.I've tried it.But the task still runs on the CPU, not the GPU, which bothered me. |
hi... i am not sure if this is helpful. when i encountered that issue about the torch not being compiled properly when i am using windows... i decided to install ComfyUI under WSL2 / Ubuntu 24.04 instead.
|
What I found helped was deleting the startup script in the Manager folder. For me this line was the issue "'.', 'F:\Comfy\ComfyUI_windows_portable\python_embeded\python.exe', '-m', 'pip', 'install', 'flash_attn']" |
I have adopted a fresh installation, encountering the same issue. I've already spent three days trying to resolve it. So far, none of the methods I've tried have worked, and I also feel like the speed when using the sdxl model is not as fast as before (this might be my perception). In order to address this warning, I have switched the CUDA version in the system to 12.1 and tried different versions of Torch, but the warning persists. I want to know if this has any negative impact on my use of ComfyUI?
`
D:\AI\ComfyUI>call conda activate D:\AI\ComfyUI\venv-comfyui
Total VRAM 8188 MB, total RAM 65268 MB
xformers version: 0.0.25.post1
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4060 Laptop GPU : cudaMallocAsync
VAE dtype: torch.bfloat16
Using xformers cross attention
Import times for custom nodes:
0.0 seconds: D:\AI\ComfyUI\custom_nodes\websocket_image_save.py
Starting server
To see the GUI go to: http://127.0.0.1:8188
got prompt
model_type EPS
Using xformers attention in VAE
Using xformers attention in VAE
clip missing: ['clip_l.logit_scale', 'clip_l.transformer.text_projection.weight']
Requested to load SDXLClipModel
Loading 1 new model
D:\AI\ComfyUI\comfy\ldm\modules\attention.py:345: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load SDXL
Loading 1 new model
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:12<00:00, 1.65it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 20.37 seconds`
The text was updated successfully, but these errors were encountered: