Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Flash Attention on V100 GPU for Llama-3-VILA1.5-8B Model #109

Open
vedernikovphoto opened this issue Aug 2, 2024 · 8 comments

Comments

@vedernikovphoto
Copy link

Hi,

I am encountering an issue when running inference on the Llama-3-VILA1.5-8B model. The error message I receive is:

RuntimeError: FlashAttention only supports Ampere GPUs or newer.

I am using a V100 GPU, which is not an Ampere GPU. Could you please provide guidance on how to disable Flash Attention for this model?

Thank you!

@NuyoaHygge
Copy link

lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py Lines 608-609, swap the annotation.

@vedernikovphoto
Copy link
Author

lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py Lines 608-609, swap the annotation.

Thanks, it worked for my GPU now! However, the output is really weird; it outputs a meaningless string of empty spaces and commas. I faced the same issue with another Vision Language Model, while some other Vision Language Models work well. I believe this might be due to the transformers library version. Anyway, I also tried running VILA on the CPU, and in that case, it worked fine.

@NuyoaHygge
Copy link

lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py Lines 608-609, swap the annotation.

Thanks, it worked for my GPU now! However, the output is really weird; it outputs a meaningless string of empty spaces and commas. I faced the same issue with another Vision Language Model, while some other Vision Language Models work well. I believe this might be due to the transformers library version. Anyway, I also tried running VILA on the CPU, and in that case, it worked fine.

Me, too. I've had similar issues with redundant commas and spaces. However, when I use the VILA1.5-3B model to input a video along with some questions, it actually performs better than the 8B model. Sometimes it generates coherent responses, but other times it only replies with one to three words.

@LanceLeonhart
Copy link

lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py Lines 608-609, swap the annotation.

Thanks, it worked for my GPU now! However, the output is really weird; it outputs a meaningless string of empty spaces and commas. I faced the same issue with another Vision Language Model, while some other Vision Language Models work well. I believe this might be due to the transformers library version. Anyway, I also tried running VILA on the CPU, and in that case, it worked fine.

Hi, I also ran into this problem and got weird empty outputs. Could you please share how to solve this problem if you find a way out?

@liuyijiang1994
Copy link

Hello, could you please tell me the specific line? In the current version of the code, line 608 corresponds to this part, which I guess is not the lines everyone found.

Snipaste_2024-09-03_15-21-08

@liuyijiang1994
Copy link

Hello, could you please tell me the specific line? In the current version of the code, line 608 corresponds to this part, which I guess is not the lines everyone found.

Snipaste_2024-09-03_15-21-08

oh i see, lines 616-617

@liuyijiang1994
Copy link

Me, too. I've had similar issues with redundant commas and spaces. However, when I use the VILA1.5-3B model to input a video along with some questions, it actually performs better than the 8B model. Sometimes it generates coherent responses, but other times it only replies with one to three words.

same problem, redundant commas and spaces.

@NuyoaHygge
Copy link

我放弃,改用A卡了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants